Day 14 – 加入額外使用者行為特徵 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 14

AI & Data

從0開始的MLFLOW應用搭建系列第 14 篇

Day 14 – 加入額外使用者行為特徵

17th鐵人賽

josh_chen40

團隊三陳牛肉吉事堡

2025-09-28 07:30:25

110 瀏覽

分享至

背景與目標

在 Day 13，我們已經建立了 Pipeline，可以一鍵執行「載入資料 → 訓練 TF-IDF → 評估並記錄到 MLflow」。

不過，目前我們的模型只使用了 genre 欄位作為特徵。
👉 如果加入更多特徵，例如動畫的 type（TV、Movie、OVA…），能讓向量化結果更豐富，推薦也更合理。

今天的任務是：

修改 Pipeline，支援 genre + type 特徵組合。
在 MLflow 裡記錄 use_type=True/False 的結果。
透過 MLflow UI 比較不同特徵的影響。

檔案結構

今天我們會修改兩個檔案：

/usr/mlflow/src/pipeline/pipeline.py   # 修改 train_model，加入 use_type 特徵
/usr/mlflow/run_pipeline.py            # 主程式，設定 use_type=True/False

pipeline.py

📂 路徑：/usr/mlflow/src/pipeline/pipeline_v2.py

import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import mlflow

DATA_DIR = "/usr/mlflow/data"

class AnimePipeline:
    def __init__(self, sample_size=1000):
        self.sample_size = sample_size

    def load_data(self):
        """載入動畫資料，只取部分樣本確保 3 分鐘內可跑完"""
        anime = pd.read_csv(os.path.join(DATA_DIR, "anime_clean.csv"))
        ratings_train = pd.read_csv(os.path.join(DATA_DIR, "ratings_train.csv"))

        # 抽樣，避免跑全量太久
        anime = anime.sample(self.sample_size, random_state=42).reset_index(drop=True)
        return anime, ratings_train

    def train_model(self, anime, max_features=1000, ngram_range=(1,1), min_df=2, use_type=True):
        """用 TF-IDF 訓練 item-based 模型，可以選擇是否加入 type 特徵"""

        # ✅ 拼接 genre + type 作為新的特徵
        if use_type:
            anime["features"] = anime["genre"].fillna("") + " " + anime["type"].fillna("")
        else:
            anime["features"] = anime["genre"].fillna("")

        vectorizer = TfidfVectorizer(
            stop_words="english",
            max_features=max_features,
            ngram_range=ngram_range,
            min_df=min_df
        )
        tfidf = vectorizer.fit_transform(anime["features"])
        sim_matrix = cosine_similarity(tfidf)
        return sim_matrix

    def evaluate_and_log(self, anime, sim_matrix, params):
        """簡單評估 Precision@10 並 log 到 MLflow"""

        def precision_at_k(recommended, relevant, k=10):
            return len(set(recommended[:k]) & set(relevant)) / k

        # ✅ 抽樣 30 部動畫做測試，加快速度
        test_idx = np.random.choice(len(anime), 30, replace=False)
        scores = []
        for idx in test_idx:
            sim_scores = list(enumerate(sim_matrix[idx]))
            sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
            top_idx = [i for i, _ in sim_scores[1:11]]
            recommended = anime.iloc[top_idx]["name"].tolist()
            relevant = anime[anime["genre"] == anime.iloc[idx]["genre"]]["name"].tolist()
            if len(relevant) > 1:
                scores.append(precision_at_k(recommended, relevant, k=10))

        avg_precision = np.mean(scores)

        # 👉 mlflow.start_run(): 開始一個新的實驗 run
        with mlflow.start_run(run_name="pipeline-tfidf") as run:
            mlflow.log_params(params)   # 記錄參數
            mlflow.log_metric("precision_at_10", avg_precision)  # 記錄指標

            print("Run ID:", run.info.run_id)
            print("Artifact URI:", run.info.artifact_uri)

        return avg_precision

run_pipeline.py

📂 路徑：/usr/mlflow/day14_run_pipeline_v2.py

import mlflow
from src.pipeline.pipeline_v2 import AnimePipeline

# 👉 mlflow.set_tracking_uri(): 指定要連線的 MLflow Tracking Server
mlflow.set_tracking_uri("http://mlflow:5000")

# 👉 mlflow.set_experiment(): 指定實驗名稱，若不存在會自動建立
mlflow.set_experiment("anime-recsys-pipeline_v2")

def main():
    pipeline = AnimePipeline(sample_size=1000)
    anime, ratings_train = pipeline.load_data()

    # ✅ 可以調整 use_type=True/False，比較兩種特徵的差異
    params = {
        "max_features": 1000,
        "ngram_range": (1,1),
        "min_df": 2,
        "use_type": True
    }

    sim_matrix = pipeline.train_model(anime, **params)
    score = pipeline.evaluate_and_log(anime, sim_matrix, params)

    print(f"Pipeline 完成 ✅ Precision@10 = {score:.4f}")

if __name__ == "__main__":
    main()

執行方式

在 python-dev 容器內執行：

python day14_run_pipeline_v2.py

執行結果：

MLflow UI → Experiments → anime-recsys-pipeline 裡會看到：
- use_type=True 的 run
- 如果改成 use_type=False，就能比較兩種特徵的效果差異。

今天新學到的 MLflow 語法

mlflow.log_params(params)
一次 log 多個參數（例如 max_features, ngram_range, use_type）。
mlflow.log_metric("key", value)
記錄單一數值指標，方便後續在 MLflow UI 排序比較。

流程圖

anime_clean.csv
   │
   ▼
[ genre (+ type) 特徵拼接 ]
   │
   ▼
TF-IDF → Cosine Similarity → Precision@10
   │
   ▼
MLflow Tracking (params + metrics)

重點總結

今天我們在 Pipeline 裡新增了 use_type 參數：
- use_type=False → 只用 genre
- use_type=True → genre + type
透過 MLflow，我們能直接比較兩個 run 的 Precision@10，分析是否有幫助。
整個流程只抽樣 1000 筆動畫 + 測試 30 筆，保證 3 分鐘內完成。